Genome-wide identification of associations between enhancer and alternative splicing in human and mouse

Background Alternative splicing (AS) increases the diversity of transcriptome and could fine-tune the function of genes, so that understanding the regulation of AS is vital. AS could be regulated by many different cis-regulatory elements, such as enhancer. Enhancer has been experimentally proved to regulate AS in some genes. However, there is a lack of genome-wide studies on the association between enhancer and AS (enhancer-AS association). To bridge the gap, here we developed an integrative analysis on a genome-wide scale to identify enhancer-AS associations in human and mouse. Result We collected enhancer datasets which include 28 human and 24 mouse tissues and cell lines, and RNA-seq datasets which are paired with the selected tissues. Combining with data integration and statistical analysis, we identified 3,242 human and 7,716 mouse genes which have significant enhancer-AS associations in at least one tissue. On average, for each gene, about 6% of enhancers in human (5% in mouse) are associated to AS change and for each enhancer, approximately one gene is identified to have enhancer-AS association in both human and mouse. We found that 52% of the human significant (34% in mouse) enhancer-AS associations are the co-existence of homologous genes and homologous enhancers. We further constructed a user-friendly platform, named Visualization of Enhancer-associated Alternative Splicing (VEnAS, http://venas.iis.sinica.edu.tw/), to provide genomic architecture, intuitive association plot, and contingency table of the significant enhancer-AS associations. Conclusion This study provides the first genome-wide identification of enhancer-AS associations in human and mouse. The results suggest that a notable portion of enhancers are playing roles in AS regulations. The analyzed results and the proposed platform VEnAS would provide a further understanding of enhancers on regulating alternative splicing. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08537-1.

The regulation of AS relies on numerous cis-regulatory elements, including cis-acting splicing regulatory elements (SREs), splicing motifs, and enhancers. SREs include exonic/intronic splicing enhancers or silencers. Wang et al. had conducted a systematical method for the identification of these SREs [9]. Some splicing motifs have been reported to be correlated with regulation of AS. For example, Holste et al. had provided a computational framework to identify splicing motifs and to predict AS events [10]. Enhancer had also been reported to correlate to AS changes [11][12][13].
Enhancer is a cis-regulatory element known as its characteristics: high abundance in genome, regulating genes in highly variable location, and lack of discriminative DNA sequence [14]. Enhancers have been demonstrated to physically interact with promoter and polymerase during transcription elongation [15,16]. This physical interaction shortens the distance between enhancer and gene body, and further grants enhancers an opportunity to influence AS. Previous studies had demonstrated that enhancer can affect alternative splicing. For example, the insertion of the SV40 transcriptional enhancer is capable of inhibition of inclusive form of fibronectin extra domain I [11]. Another one example is that the downstream enhancer of protocadherin alpha can loop back to bind with promoter by coupling of CTCF and further affect AS [17]. These studies had shown that enhancer is capable of affecting AS events.
A previous study suggested that most of enhancers are inactive (poised) until the proper factor binds on it [18]. Thus, it is challenging for biologists to design a highthroughput experiment to identify the enhancer-AS associations. Because there is no genome-wide study to identify the associations, in this context, we developed a bioinformatics pipeline (Fig. 1A) to find out the significant enhancer-AS associations on a genome-wide scale by analyzing large amount of human and mouse transcriptomes. We further constructed a platform entitled VEnAS (Visualization of Enhancer-associated Alternative Splicing) to present the enhancer-AS associations.

Data selection and preparation
We downloaded enhancer datasets which include 28 human and 24 mouse tissues and cell lines from  Fig. 1 (A)Workflow or the analysis pipeline for identification of enhancer-AS associations. In the top left part of the analysis pipeline, we focused on enhancer datasets polish, including position refining and presence/absence calling. In the top right part, we focused on the processes for quantification of AS changes. We then conducted association analysis to identify enhancer-AS associations, and finally constructed a website called VEnAS for data visualization. (B) An example of refining enhancers between different tissues and cell lines.The blue boxes are representing to enhancers in different tissues or cell lines. The middle positions of enhancers are used for hierarchical clustering with centroid method. The cuttree threshold is set as 3 kilo bases. The green and orange boxes are representing to the two refined groups under the threshold enhancerAtlas [19]. These tissues and cell lines were chosen because they have at least three paired RNA-seq datasets for quantification of AS. To prevent the data imbalance, we down-sampled the number of RNA-seq datasets to three. We then downloaded the chosen 84 human (28*3) and 72 mouse (24*3) RNA-seq fastq files from Sequence Read Archive (SRA) [20]. These fastq files were mapped onto the latest genome (GRCh38 for human and GRCm38 for mouse) by HISAT2 [21] with default parameters.

Enhancer calling
The boundaries of enhancer could be incongruent due to tissue characteristics, enhancer calling methodologies, or batch effects from input data sets. Thus, refining the location of enhancers between different tissue types is required to eradicate the incongruence. To refine the enhancers between different tissues and cell lines, we took advantage of agglomerative hierarchical clustering with centroid method (Fig. 1B). We used the central position of each enhancer as input for hierarchical clustering.
Previous studies had reported that the length of enhancer is ranged between 2-4 kilo bases [18,[22][23][24]. Thus, we set 3 kilo bases as a threshold to limit the growth of the clusters. After refining the location of enhancer, we were able to call the present or absent of enhancer between different tissues based on whether there is any enhancer located in the refined range.

Quantification and categorization of AS
The v94 human and mouse genome annotations were downloaded from Ensembl. CATANA [25] was used upon the human and mouse genome annotation to obtain the latest version of AS annotation. The latest AS annotation and the mapped bam files (from data preparation) were used for MISO [26] to compute percent splice in (PSI), which is an inclusion index based on the number of junction reads [27]. The equation of PSI is defined as To guarantee the AS changes of a given AS event from human 84 or mouse 72 samples are large enough, we removed the AS events with the PSI range across all samples less than 0.1. After that, we conducted Z-transformation upon all PSI values across tissues to capture the changes of a given AS between tissues. To categorize whether a tissue does have an AS change, the tissue having Z-transformed PSI value (Z-PSI) larger than 1 is defined as "inclusive shift", while the tissue having Z-PSI smaller than -1 is defined as "exclusive shift".

PSI =
Junction reads supporting to inclusive form Junction reads supporting to inclusive form + Junction reads supporting to exclusive form

Association analysis
With the labels present/absent of enhancers and inclusive/exclusive shift of AS changes, for each enhancer-AS pair we can generate a two-by-two contingency table containing the number of samples in the four cells. We removed enhancer-AS pairs having low strength of association in the contingency table to improve the precision of the association analysis and reduce the false results. Thus, we only included the enhancer-AS pairs for analysis in which the odds ratio must be larger than 2 or less than 0.5 accompanied by the effective size constrain (the number difference between concordant and discordant cells must be larger than 10). Then the Fisher exact test was conducted exhaustively throughout all the enhancer-AS pairs to calculate the p-value. All the p-values were then adjusted by Benjamini-Hochberg procedure false discovery rate (FDR) to obtain q-values. An enhancer-AS association was considered significant if the q-value is smaller than 0.05.

Implementation of VEnAS
The VEnAS database was written by a combination of Perl, Python, and R for data processing and statistical analysis. The web server of VEnAS was implemented with a combination of PHP, Google Polymer framework, and MySQL on Ubuntu server. For efficiently storing and querying, the analysis result and other integrated data were subjected to database normalization. The schema of the normalized MySQL database table is shown in the Figure S1. The tables holding PSI and genomic location of enhancer were separated for parallel querying by MySQL. In addition, the table holding index for autocompletion during user query is shown on the top-left side of Figure S1. The keywords used for constructing index include Ensembl gene accession, gene symbol, and gene description.

Results
To identify the enhancer-AS associations on a genomewide scale, we developed an analysis pipeline (Fig. 1A, detailed in Methods). We first curated the enhancer profiles and RNA-seq datasets of 28 human and 24 mouse tissues and cell lines for analysis. Since the profile of active enhancer is naturally varied between different tissues and cell lines [28], we refined the boundaries of enhancers and generated enhancer calling using the hierarchical clustering method. We then used CATANA and MISO to quantify and categorize AS events from RNA-seq datasets. To further check the similarity or overlapping event between different samples, we computed the Jaccard coefficient index ( Figure S2). The result shows that the enhancer-AS events are quite similar within the triplicated samples under the same tissue type but different between tissues. The Fisher exact test was performed to identify the significant enhancer-AS associations with present/absent of enhancer and inclusive/exclusive shift of AS types.

Enhancer-AS associations in human and mouse
By conducting association analysis with absent/present of enhancer and inclusive/exclusive shift of AS event, we found that 3,242 human genes and 7,716 mouse genes have at least one significant enhancer-AS association, and 11,262 human enhancers and 26,083 mouse enhancers are participating in AS changes (Table 1 and Table 2). Previous study had mentioned that transcripts having alternative start and termination sites shape the major transcriptome diversity across human tissues [13]. As expected, in our results, the numbers of genes having associations between enhancers and the AS types regarding alternative transcription initiation and termination sites (AFE, ALE, ATSS and ATTS) are notably higher than the six canonical AS types (A5SS, A3SS, SE, RI, MSE, MXE) in human and mouse (Fig. 2). Table 1 The counting table of human genes and enhancer having enhancers having enhancer-AS association for different AS types. The row "All" represents the number of genes or enhancers having associations in any types of AS. The column "input" means the number of genes or enhancers which are qualified for the analysis. The column "significant" represents the number of genes or enhancers pass the q-value smaller than 0.05

AS type
Counting by genes Counting by enhancers  Table 2 The counting table of mouse genes and enhancer having enhancer-AS association for different AS types. The row "All" represents the number of genes or enhancers having associations in any types of AS. The column "input" means the number of genes or enhancers which are qualified for the analysis. The column "significant" represents the number of genes or enhancers pass the q-value smaller than 0.05 Gene and enhancer are many-to-many relationship [29]. One given gene could be associated to multiple enhancers, and vice versa. Here we would like to know that under consideration of association with AS changes, how many enhancers are associated to one given gene and how many genes are associated to one given enhancer. We further interrogated the association relationship between enhancer and genes by examining the number of enhancers per gene (also the genes per enhancer). According to the annotation from enhancerAtlas, on average, each gene is paired with 60.32 enhancers in human and 68.47 enhancers in mouse. Our association analysis suggests that given one gene, on average, 3.88 of the 60.32 enhancers (6.43%) in human and 3.54 of the 68.47 enhancers (5.17%) in mouse are associated to AS change ( Figure S3A and S3B). For enhancers, on average each one enhancer is paired to 7.66 genes in human and 9.29 genes in mouse according to enhancerAtlas, but in our result one enhancer is significantly associated to AS change with only 1.28 genes human and 1.22 genes in mouse ( Figure S3C and S3D).

Investigations of the genetic properties of identified enhancer-AS associations
To further understand the genetic properties of identified enhancer-AS associations, we observed the proportion of enhancer-AS associations which have both homologous genes and homologous enhancers between human and mouse. For each gene in human, we defined its homolog in mouse according to the homologs list provided in Mouse Genome Informatics (MGI) [30]. For each enhancer in human, we obtained its homologous enhancers in mouse by conducting the CrossMap [31] with the human and mouse chain file and, which is the pairwise alignment between two reference assemblies from Ensembl [32]. We found that about 52% of the significant and 35% of the insignificant enhancer-AS associations in human have homologous genes accompanied with homologous enhancers in mouse ( Table 3). The Welch two sample t-test shows significant difference (p-value = 5.56 × 10 -11 ) upon percentages of significant enhancer-AS pairs with homologous genes and enhancers in all ten types of AS against insignificant groups. This suggests that the significant enhancer-AS pairs are more likely to be the co-existence of homologous genes and homologous enhancers than insignificant enhancer-AS pairs. Similar trends with lower percentages were found when we check the significant enhancer-AS pairs (Welch two sample t-test p-value = 1.906 × 10 -13 ) in mouse (Table 4). These results show that the significant enhancer-AS associations we identified are more likely to be the co-existence of homologous genes accompanied with homologous enhancers in both human and mouse rather than conservation of enhancer sequence only.

Visualization of enhancer-AS associations
To visualize the enhancer-AS associations, we constructed a platform named VEnAS. VEnAS provides intuitive genomic architecture, association plot, and contingency table of all the significant enhancer-AS associations (Fig. 3). To query VEnAS, users can input Ensembl gene ID or gene symbol (Query 1 in Fig. 3). The auto-completion function would help users find out the gene of interests. The web server provides portable gene information for convenient linking to Ensembl, NCBI, and RefSeq (Result 1 in Fig. 3). After users select an AS type and a corresponding enhancer (Query 2 in Fig. 3), VEnAS shows the architecture of the gene with enhancer, association plot, and a two-by-two contingency table (Query 2 in Fig. 3). For splicing display, the bending curve drawn above exons represents the inclusive form of AS products, while the curve drawn below exons represents the exclusive form. The width of curves represents the number of biological replicates which support the association events. Moreover, the colors denote whether enhancer is active or inactive. In the top of a two-by-two contingency table, the FDR adjusted q-value of the Fisher exact test and the odds ratio are also provided. Inside the table, the color boxes are representing biological replicates having AS shifted to inclusion or exclusion. The color intensity of the boxes is proportional to the Z-PSI.
The tissue name and Z-PSI would be displayed when the mouse cursor is hovering atop the box. Additionally, VEnAS provides batch retrieval function. The user could send a list of Ensembl gene ID(s) obtained from any other analysis tool or software in the batch retrieval web page through pasting in dialog box or uploading file. VEnAS can convert the visualized results into PDF file format for users for further analyses.

Case study
We have identified lots of enhancer-AS associations in this study. However, it is difficult to find out large scale biological evaluation or literature evidence. Hence, we performed comparative genomics analysis between human and mouse as well as observed the splicing events to further evaluate the identified associations. Below is a case demonstrating the robustness of our finding. In Manduchi et al.'s study [33], they identified 35 significant SNP marks and enhancers which are associated to Type 2 diabetes with combination of epigenomic markers and genome wide association studies (GWAS). In their result, gene ST3GAL4 is associated to two SNP markers located within an enhancer which is named chr11_1460 in our Table 3 The counting table of human significant and insignificant enhancer-AS pairs accompanied with both homologous enhancers. The numbers of total significant enhancer-AS pairs, significant enhancer-AS pairs with homologous genes and enhancers, total insignificant enhancer-AS pairs, insignificant enhancer-AS pairs with homologous genes or enhancers in all ten types of AS are provided. The percentages of enhancer-AS pairs with homologous genes and enhancers in all ten types of AS are calculated   . 3 The webpage of VEnAS and query steps. Following by the queries (e.g. Query 1 and 2), users can obtain gene architecture, association plot, and detailed statistical information (e.g. Result 1 and 2) of VEnAS database conveniently system. As shown in Fig. 4A, the enhancer is marked by ENCODE as a cis-regulatory element in human. As shown in comparative genomics data track, the genomic region of this enhancer is located within synteny between human and mouse. In mouse, the associated enhancer is named chr9_3600, which is also marked as a cisregulatory element by ENCODE and within the syntenic region shared with human enhancer chr11_1460 (Fig. 4B). Furthermore, we utilized MISO to draw sashimi plots and PSI histograms [26] to illustrate that the presence/absence of the associated enhancer is associated to skipped exon event (SE) of ST3GAL4 in human (Fig. 4C). The PSI histograms show that all the PSI values are closed to "1", i.e. the inclusive form dominated, in the samples where the enhancer is present. On the contrary, the PSI values are decreased to about 0.5 in the samples when the enhancer is absent. The strength of association between enhancer and SE is significant ( q − value = 2.909x10 −2 , as shown in Fig. 4D). Taking together the literature evidence, comparative genomics data, and PSI distribution; we did successfully demonstrate the existence of the enhancer-AS association.

Discussions
Previous studies showed that some enhancers are conserved between human and mouse [34] while some enhancers might be reprogrammed after humanmouse speciation [35]. To investigate whether the enhancers associated to AS existed in human and mouse are conserved or not, we further examined the conservation score difference between significant and insignificant enhancers. The conservation score of enhancer sequence between human and mouse were downloaded from Ensembl v94 compara 32 amniotes  [36]. After comparison, we didn't find any difference of conservation score of enhancer sequence between significant and insignificant enhancer-AS associations (data not shown).
As we already know that enhancers could also serve as a hub for binding of transcription factors [18], we tried to annotate known motifs on enhancer regions by DREME [37] and TomTom [38] with position frequency matrix from JASPAR [39]. However, we didn't find any differentially enriched known motifs shared between human and mouse. Though we didn't find any advanced evidence, more data sets are required to conclude that the enhancers associated to AS are newly emerged or reprogrammed after human-mouse speciation.
In 2013, a new concept of super-enhancer had been proposed [40,41]. Super-enhancers are considered to be a cluster of several different enhancers with exceptional higher binding of transcriptional coactivators [40,41]. Super-enhancers are usually longer than typical enhancers, with a median length of 8.7 kb [42].
Recently, more and more super-enhancer databases about super-enhancer characteristics and associated genes are available, such as dbSUPER [43], SEdb [44], and SEA [45]. It has been reported that superenhancer is capable of regulating alternative splicing in smooth muscle [46]. However, our current statistical analysis method is designed for one enhancer on one AS event rather than multiple/combinatorial enhancers on one AS event. To pin-point the correlation between the combination of transcription factors and AS events requires a more sophisticated method. In the future, we will pursue a genome-wide method to reveal the correlation between super-enhancer and alternative splicing event.

Conclusion
In this study, an analysis pipeline to identify enhancer-AS associations was proposed. We included 84 RNAseq data sets across 28 tissues and cell lines in human and 72 RNA-seq data sets across 24 tissues and cell lines in mouse for analysis. In total, 3,242 human genes and 7,716 mouse genes having at least one significant enhancer-AS were identified. On average, about 5-6% of the enhancers of one given gene are associated to AS change, and one given enhancer is associated to 1.28 human or 1.22 mouse genes. The significant enhancer-AS associations are more likely to be the co-existence of homologous genes and homologous enhancers in both human and mouse. Finally, we constructed VEnAS to provide comprehensive enhancer-associated AS results for scientists, including genomic architecture, intuitive association plot, and contingency table. We believe that our study is helpful in further understanding the roles of enhancers on regulating alternative splicing.
Additional file 1: Figure S1. Thedetailed schema of data warehousing in MySQL. All the data tables includingcolumn names are illustrated. The primary keys for linking tables are depictedwith black lines. Figure S2. Thecomparison of sample similarity in skipped exon (SE).The Jaccardcoefficient index is pair-wisely computed to present the enhancer-AS similarityor overlapping between different samples. The number of enhancer-AS eventswhich have identical enhancer present/absent calling and the sameinclusive/exclusive AS shift are calculated, and then divided by the totalenhancer-AS events to compute the Jaccard coefficient index. The result showsthat the enhancer-AS events are different between tissues but quite similarwithin the triplicated samples under the same tissue type, except fetal stomachSRR980482 which has the lowest Jaccard index score comparing to the other twofetal stomach samples.